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Recently, a statistical-decision approach to the problem of voiced- 
unvoiced-silence detection of speech was proposed by Atal and Rabiner. 
This method was found to perform well on high-quality speech. How- 
ever, the five speech parameters used in the analysis were not found 
to be as good for telephone-quality speech. Thus, an investigation was 
undertaken to determine a suitable set of parameters that would pro- 
vide a reliable voiced-unvoiced-silence decision across a variety of 
standard telephone connections. A large number of parameters (70) 
were included in the investigation, including 12 LPC coefficients, 12 
correlation coefficients, 12 parcor coefficients, 12 LPC partial error 
terms, etc. Many of the parameters were immediately eliminated be- 
cause they provided almost no separability between the three decision 
classes. The remaining parameters were used in a knockout optimi- 
zation to determine the five best parameters to use for a voiced-un- 
voiced-silence analysis. Various error weights were investigated to see 
what types of errors occurred and how they could be minimized. Finally, 
the use of the Itakura two-pole spectral normalization was investigated 
to see its effect on the error scores. 



I. INTRODUCTION 

In a recent paper, Atal and Rabiner described a fairly sophisticated 
method for reliably classifying segments of a waveform as voiced speech, 
unvoiced speech, or silence. 1 The analysis method used a statistical 
pattern-recognition approach to make this three-class decision. In an- 
other investigation, Rabiner et al. showed that the accuracy of the 
classification algorithm was quite high when the input signal was 
wideband; however, for telephone speech inputs, the accuracy of the 
classification degraded quite significantly. 2 The reason for this result 
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was not that the method inherently broke down for telephone inputs, 
but instead that the particular parameter set effective for wideband 
inputs was not equally effective for band-limited inputs. Thus, the 
motivation for the work to be presented in this paper is to investigate 
the suitability of a large number of parameters as features for reliable 
voiced-unvoiced-silence classification for telephone-quality speech. 

Figure 1 shows a block diagram of the basic voiced-unvoiced-silence 
analysis algorithm. As shown in this figure, there are three steps in the 
method. First the speech is preprocessed. Generally, this preprocessing 
is a simple filtering operation; e.g., in the earlier work, a 200-Hz highpass 
filter was used to remove dc, hum, or low-frequency noise components 
present in the input signal. For telephone line inputs, we have considered 
somewhat more sophisticated preprocessing; namely, we have studied 
the use of a second-order inverse filter (as originally proposed by Ita- 
kura 3 ) to normalize out the effects of varying telephone lines. 

The second step in the algorithm is the feature measurement stage. 
For wideband inputs, only five parameters were considered, namely: 

(i) Energy of the signal 
(ii) Zero-crossing rate of the signal 
(Hi) Autocorrelation coefficient at unit sample delay 
(iv) First predictor coefficient 
(v) Energy of the prediction error. 

These measurements were shown to provide a high degree of separability 
between the three classes of signal for wideband inputs. 1 However, for 
telephone-quality inputs, the band-limiting of the telephone line con- 
siderably reduces the effectiveness of all of the parameters in separating 
the classes of voiced speech, unvoiced speech, and silence. For example, 
the absence of signal energy above about 3 kHz significantly reduces the 
number of zero crossings for unvoiced speech. 

To find an effective set of parameters that would be capable of reliably 
distinguishing between the three signal classes for telephone line inputs, 
a large number of parameters (70 in total) were studied. Using a set of 
training data, the probability-density functions for each of the param- 
eters were estimated. Those parameters that provided little or no sep- 
aration between voiced speech, unvoiced speech, and silence were 
eliminated from consideration. The remaining 36 parameters were 
studied as to their effectiveness in classifying telephone line inputs. A 
knockout type optimization was used to obtain the five most effective 
parameters for classifying signals according to an error-weighting 
scheme. Several combinations of different test sets of data and error 
weights were investigated. 

The final step in the analysis method of Fig. 1 is a distance computa- 
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Fig. 1 — Block diagram of silence-unvoiced-voiced classification system. 

tion to determine whether a test signal is voiced, unvoiced, or silence. 
For this step, the non-Euclidean distance metric of Ref. 1 was retained 
because of its invariance properties to linear transformations of the 
data. 4 

Before presenting the results of the investigation, it is worthwhile 
reviewing the major distortions of telephone line signals as compared 
to wideband signals recorded with a high-quality microphone. These 
distortions include: 

(j) Band limitation — The frequency response of a telephone line is 

approximately band limited between 300 Hz and 3000 Hz. 
(u) Phase distortion— For the frequency band between 300 and 3000 
Hz, the magnitude of the incoming signal remains relatively flat; 
however, the phase is altered significantly in this band. 
(Hi) Nonlinear effects — Various nonlinearities occur in telephone 
transmission, including amplitude distortion (signal fading), peak 
and center clipping, impulse and/or gaussian noise addition, 
crosstalk, etc. 

The effects of the first type of distortion are the most significant as far 
as this analysis method is concerned.* However, the other types of dis- 
tortion can, and often do, play a role in determining an effective set of 
parameters for classifying telephone line signals. 

The organization of this paper is as follows. In Section II, we present 
a description of the techniques used to determine the most effective sets 
of five parameters for classifying the incoming signals. In Section III, 
we present the results of the knockout optimization tests for each of the 
test sets of data and for each set of error weights. Finally, in Section IV, 
we compare the results on telephone inputs to those obtained with 
wideband inputs. A typical example showing how the method ultimately 
performed is presented to illustrate the types of problems that occur with 
telephone inputs. 



* In this work we are considering only those distortions that occur within a local PBX; 
thus, one would expect a minimum of phase distortion and other nonlinear effects to occur. 
The place in which such distortions can become significant is in long-distance transmis- 
sions. 
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II. TELEPHONE SIGNAL ANALYSIS SYSTEM 

For the preprocessing step of the analysis, a single highpass filter was 
always used to eliminate hum, dc offset, and low-frequency noise. This 
filter is described in Ref. 1. A second type of preprocessing was also in- 
vestigated: the spectrum normalization technique as originally proposed 
by Itakura. 3 In this technique, the gross long-time spectrum of the signal 
is estimated using a two-pole LPC model, and then the signal is inverse 
filtered to remove the gross spectral tilt. Using the two-pole spectral 
normalization to reduce the spectral variability should, theoretically, 
also make the feature estimates more reliable. The rationale for con- 
sidering this form of preprocessing is that for telephone speech the in- 
dividual telephone transducer and line responses vary greatly across 
different handsets and telephone lines. Thus, any features estimated 
over such varying conditions may be adversely affected by the inherent 
variability of the transmission medium. 

The way in which the two-pole spectral normalization was imple- 
mented is shown in Fig. 2. For each frame (10 ms of data), three corre- 
lation coefficients, R(m), m = 0,1,2, are computed using the relation 



N-m 

Rj(m) = Y. Sj(n)sj(n + m) m = 0,1,2, 

n=0 



(1) 



where N is 100, the sampling frequency is 10 kHz, and ;' is a frame 
counter that goes from 1 to NF, the number of frames in the utterance. 
The weighted normalized averages of the first two correlation coefficients 
(the m = 1, m = 2 terms) are computed as 



R(m) = 



Wj(m), m = 1,2, (2) 



NCPi Rj(0) 

where Wj(m) is a weight on the correlation function of the form 

rl ifi?,(0)>T 



Wj(m) = 



otherwise 



(3) 



i.e., only frames whose energy [R/(0)] exceeds a fixed threshold T are 
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Fig. 2 Block diagram of two-pole spectral-normalization system. 
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included in the computation of the average correlations. The factor NC 
in eq (2) is the number of frames that exceeds the threshold of eq. (3). 
The weighting is used in computing the average correlations to eliminate 
unvoiced sounds and silence for which the correlation values are sig- 
nificantly different from those for voiced frames. 

The third step in the normalization procedure is to compute the pre- 
dictor coefficients of a two-pole linear predictive coding (LPC) match 
to the long-time average gross spectrum. If we denote the two LPC 
coefficients as a\ and a 2, then the inverse filter needed to normalize the 
speech spectrum has a transfer function 

A{z) = I - a x z~ x - a 2 z~ 2 . (4) 

On a frame-by-frame basis the inverse filter can be applied directly to 
the autocorrelation coefficients of a high-order LPC analysis of the signal 
by convolving them with the autocorrelation coefficients of the sec- 
ond-order inverse filter. 3 

2. 1 Features used in the analysis 

The parameters (features) studied in the course of this work included 
the following: 

Par am - 
eter Description 

1-12 The LPC coefficients of a 12th-order analysis using the Burg lattice method with 
a 10-ms frame 5 ": a(l) to a(12). 
13-24 The first 12 autocorrelation coefficients of the signal using a 10-ms frame: 0(0,1) 
to 0(0,12). 

25-36 The first 12 parcor (partial correlation) coefficients of the signal: k(l) to 

*U2). 
37-48 The first 12 partial normalized error coefficients of the LPC analysis: £(1) to 

E(12). 
49-60 The first 12 cepstral coefficients of the signal as obtained by transforming the 

LPC coefficients: c(l) to c( 12). 

61 The log energy of the signal: LE. 

62 The number of zero crossings per 10-ms frame: NZ. 

63 The log normalized error of the 12-pole LPC analysis: LNE. 

64 The maximum value minus the minimum value of the signal during the frame: 
ML. 

65 The absolute energy in the first difference of the signal: ED. 

66 The number of zero crossings per 10-ms frame for the first difference signal: 
NZD. 

67 The maximum value minus the minimum value for the first difference signal: 
MLD. 

68 The absolute energy of a smoothed version of the signal: ES. 

69 The number of zero crossings per 10-ms frame for the smoothed signal: NZS. 

70 The maximum value minus the minimum value for the smoothed signal: MLS. 

Figure 3 shows the basic measurement scheme. For each 10-ms frame 
of the signal, an LPC analysis was performed using the Burg lattice 
method 56 giving a set of 12 LPC coefficients, 12 parcor coefficients, and 
12 partial normalized errors defined as 
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Fig. 3— Block diagram of feature-measurement system. 
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(5) 



i.e., E(l) is the normalized error of an /-pole LPC analysis. Since the lattice 
method does not require the set of correlations directly, they are com- 
puted on the signal from the equation 



N-l 

<f>(o,i) = E s(n)s(n - i), 

n = Q 



i = 1,2, • • • ,12, 



(6) 



i.e., a nonstationary correlation function is computed. The cepstral 
coefficients are computed directly from the LPC coefficients using the 
recursion relation 



i— 1 m 

c(i) = a(i)- £ — c(m)a(i - m), 

m = \ I 



1 < i < 12. 



(7) 



Two other measurements are made directly on the signal s(n). These 
are the zero-crossing count defined as the number of zero crossings per 
10-ms interval, and a computation of the difference between the maxi- 
mum and minimum signal amplitudes in the frame. 

In addition to the above parameters, six additional measurements are 
made on the first difference of the signal, d(n), defined as 

d(n) = s(n) - s(n - 1) (8) 
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Fig. 4 — Frequency response of low pass smoothing filter. 

and a smoothed version of the signal obtained via the filtering rela- 
tion* 

s(n) = -s(n) + s(n - 2) + 2s(n - 3) + 4s(n - 4) + 4s(n - 5) 

+ 4s(n - 6) + 2s(n - 7) + s(n - 8) - s(n - 10). (9) 

It can be seen that the filtering of eq (9) can be accomplished without 
the need for a multiplier and, thus, can be implemented quite efficiently. 
Figure 4 shows the frequency response of this filter. It can be seen that 
the filter provides a small amount of high-frequency attenuation and 
therefore can be considered as a lowpass smoothing filter. The mea- 
surements made ond(n) and s(n) are zero-crossing count, absolute en- 
ergy, and difference between maximum and minimum signal levels in 
the frame. 



* This filter as well as parameters 65-70 were suggested by D. R. Reddy for inclusion in 
this work. 
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Fig. 5— Probability distributions for log energy of the signal for silence, unvoiced, and 
voiced classes. Both estimated and gaussian fits to the distributions are shown. 

Once the initial set of 70 parameters was chosen, a training set of data 
was used to estimate one-dimensional probability functions for each of 
the parameters and for each signal classification. A one-dimensional 
gaussian curve having the same mean and standard deviation as the 
measured distributions was also computed for each parameter. Figures 
5 through 7 show three typical distributions for the parameters log energy 
(feature 61), first LPC coefficient (feature 1), and twelfth LPC coefficient 
(feature 12), respectively. For the log-energy parameter (Fig. 5), the 
distributions for silence, unvoiced, and voiced speech were fairly well 
separated with means of 18, 34, and 49 dB, respectively. Similarly the 
distributions for the first LPC coefficient (Fig. 6) were also well separated 
with means of -0.19, -0.66, and -1.9 for silence, unvoiced, and voiced 
speech, respectively. However, as shown in Fig. 7, the distributions for 
all parameters were not well separated across the different classes. In 
this case, the distributions for all three signal classes overlapped con- 
siderably. It seems reasonable that features in which such behavior is 
observed will not be effective in the classification procedure. Therefore, 



* For the distance metric used in this work, it is not critical that the one-dimensional 
distributions of the parameters be well approximated by a simple gaussian curve. It is 
important, however, that the distributions be unimodal. 
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Fig. 6— Probability distributions for first LPC coefficient for silence, unvoiced, and voiced 
classes. 

such parameters were not considered in the testing to be described in 
this paper. 

A total of 34 of the 70 parameters were eliminated in this manner. The 
parameters eliminated were the higher LPC coefficients [a(5) to a(12)], 
the higher autocorrelation coefficients [0(0,5) to 0(0,12)], the higher 
parcor coefficients [k(5) to fe(12)], the higher cepstral coefficients [c(5) 
to c(12)], and the last two partial normalized LPC error coefficients 
[£(11) and £(12)]. The remaining 36 parameters were used in all the 
optimization tests described in the next section. 

2.2 Knockout optimization procedure 

To choose the set of five parameters out of the remaining 36 features 
that best (most accurately) classified signal intervals as silence, unvoiced, 
or voiced speech, a knockout optimization procedure was used. 7 Figure 
8 shows a flow diagram of the procedure. Using a testing set of data (see 
Section 2.3) and an objective error measure, the knockout optimization 
proceeded first to find the single best parameter for separating the three 
classes. The best parameter is knocked out and used in combination with 
each of the remaining 35 features to find the best pair of parameters for 
the signal classification. This process of knocking out the best parameter 
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Fig. 7— Probability distributions for twelfth LPC coefficient for silence, unvoiced, and 
voiced classes. 

and combining all knocked-out features with the ones remaining in the 
parameter set was iterated until a total of five parameters were ob- 
tained. 

Several comments should be made about this procedure. First, it is 
noted that this method does not necessarily yield the optimum set of five 
parameters for making the silence, unvoiced, voiced decision. In general, 
the resulting parameter set is suboptimal since only a very small subset 
of the total number of combinations of 36 parameters taken five at a time 
are considered in this method. In defense of the method, however, one 
can argue that, within the constraints of the procedure, an optimal set 
of the 36 parameters is chosen. Furthermore, at least theoretically, the 
addition of each new knocked-out feature reduces the error score. Finally, 
it is argued that the resulting feature sets provide significantly better 
accuracy for signal classification than almost any randomly chosen set 
of five of the 36 parameters. 

2.3 Distance computation 

The distance computation used throughout this investigation was the 
non-Euclidean distance metric defined in Ref. 1. For the feature vector 
x = [x(l),x(2), . . „x(5)] with mean vector m, = [m z (l),m,(2), . . .,m;(5)] 
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Fig. 8 — Flow chart of knockout optimization algorithm. 

and covariance matrix A,, the distance computation was of the form 

di = (x-m^A-^x-m,)', (10) 

where i = 1 (silence), 2 (unvoiced), or 3 (voiced), and t denotes the 
transpose of a vector. For each signal class, d, is computed and the de- 
cision rule is to select class i such that d, < dj for ally 9^i\ i.e., choose the 
class with the minimum distance to vector x. 

To implement the distance computation in eq. (10) during the 
knockout optimization required the computation of a new covariance 
matrix A, for each subset of parameters being considered. Thus, on the 
order of 420 covariance matrices had to be estimated in a typical opti- 
mization run. This represented a substantial amount of computation. 

2.4 Experimental procedure 

The formal evaluation of the feature sets was made by choosing a fairly 
large data base of different utterances and different speakers, using part 
of the data base for training the system, and using the remainder of the 
data base for testing the system. 
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A total of five speakers (two male, three female) were used in the 
telephone-line evaluation. Each speaker recited three utterances,* each 
one over a new dialed-up connection and thereby ensuring that a dif- 
ferent telephone-transmission path was obtained for each utterance. Two 
different telephone transmitters (carbon microphones) were also used 
in the test. One utterance from each speaker was used in the training set; 
the remaining two utterances were used in the testing set. 

For each recorded utterance, a manual analysis was performed on each 
10-ms interval to classify it as voiced, unvoiced, or silence based on both 
the acoustic waveform and a phonetic transcription of the utterance. 
Each signal classification was further modified with a label as to the 
certainty of the manual classification. The labels used were: 

(i) Absolutely certain — clear characteristics of the class to which it 

was assigned. 
(«*) Moderately certain— generally a boundary interval between classes 

in which two types of signal were present. 
(Hi) Uncertain— classified primarily on linguistic information about 
the utterance. Included in this class were voiced fricatives, voiced 
stops, and certain transients (including some telephone-line tran- 
sients). 

Figure 9 shows an example in which uncertain intervals occurred. This 
section of speech is from the beginning of the word cowboys. The initial 
intervals should linguistically be labelled as either silence or unvoiced 
speech corresponding to the stopgap and burst of the voiceless stop /k/. 
However, acoustically the initial seven intervals (as marked in Fig. 9) 
show properties more similar to voiced speech than to silence or unvoiced 
sounds. These intervals were treated as uncertain intervals and were 
marked as unvoiced speech for testing purposes. 

For the training set, only those intervals for which the classification 
was absolutely certain were used. For the testing set, three sets of data 
were used. One set contained only those intervals for which the classi- 
fication was absolutely certain (TSi). The second set contained both the 
moderately certain as well as the absolutely certain intervals (TS2). The 
third set contained all the intervals, regardless of the certainty of manual 
classification (TS3). In the next section, we present results for each of 
these testing sets of data. 

III. RESULTS 

The knockout optimization procedure described in Section II was run 
on the three sets of test data using 10 different error-weighting matrices. 



* Each utterance was a carefully chosen sentence containing a mixture of voiced, un- 
voiced, and silence intervals. 
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Fig. 9 — The acoustic waveform for a section of speech in which an interval was uncer- 
tain. 



In addition, the entire experiment was rerun on data preprocessed using 
the two-pole spectral normalization method described in Section II. 
Table I provides a summary of the three test sets of data, the 10 error- 
weight matrices, and the two processing conditions. 

The error-weight matrices were used to study the effects of weights 
for each type of classification error on the overall error rate and the choice 
of the optimal features. The definition of a general error-weight matrix 
is as follows. If we let E denote the overall error score in classifying the 
data of a test set, then 

E = N SS W SS + N SU W SU + N SV W SV + N US W US + N UU W UU 

+ N UU W UV + N V8 W VS + N VU W UU + N VV W VV , (11) 

where N a b is the number of frames of a class a which were classified as 
belonging to class b, and W a b is the weight attached to this pair of clas- 
sifications. It should be clear from eq. (11) that 



N s = N ss + N su + N sv 
N u = N us + N uu + N uu 



(12) 



N = N us + N vu + N uv , 
where N a is the number of frames in the test set in class a. Table II shows 
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Table I — Summary of factors considered in the investigation 



Data Test Sets 


TSl 
TS2 




TS3 


Error-Weight 
Matrices 


WM1 
WM2 

WM3 




WM4 




WMS 




WM6 




WM7 




WMS 




WM9 




WM10 


Preprocessing 


PI 

pa 



Absolutely certain intervals 

Moderately certain intervals added to TSl 

Uncertain intervals added to TS2 

Uniform matrix 
Silence weighting matrix 
Unvoiced weighting matrix 
Voiced weighting matrix 
Silence-to-unvoiced weighting matrix 
Unvoiced-to-silence weighting matrix 
Silence-to-voiced weighting matrix 
Voiced-to-silence weighting matrix 
Voiced-to-unvoiced weighting matrix 
Unvoiced-to-voiced weighting matrix 

Direct transmission 

Two-pole spectral normalization 



the 10 weight matrices described in Table I. Each matrix is expressed 
in the form 

"as " su "8V 

W us W uu Wuv 

■ "us "liu "uu ■ 



w = 



(13) 



where W a b is not generally the same as Wba- 

As seen in Table II, error weight-matrix 1 (WMi) attaches equal weight 
to all six types of misclassifications and, therefore, is the canonic error 
matrix for the three-class problem. Error matrices 2-A (WM2-WM4) each 
choose a subset in which one of the three classes is essentially merged 
with another class. For example, error matrix 4 (WM4) gives weight to 
errors between the classes of silence and unvoiced speech; however, the 
other four types of error have unity weight. Thus, this matrix serves to 
distinguish most effectively between voiced speech and nonvoiced (either 
silence or unvoiced) speech. As another example, error matrix 2 (WM2) 
gives weight to errors between the classes of voiced and unvoiced 
speech. Thus, this matrix serves to distinguish between speech (voiced 
or unvoiced) and silence. As such, it would be useful for speech-detection 
applications. Error matrices 5 through 10 each focus on only one of the 
six sets of misclassifications. The results for these cases give a lower 
bound on the error rate for special cases in which only a single type of 
misclassification is considered. 

For each of the sets of data of Table I, the knockout optimization 
procedure was used giving the five best features and the resulting overall 
misclassification rate, defined as 



En - 



E 



(N s + N u +N v ) 



(14) 
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Table II — Error-weight matrices used in the investigation 



ma 


fo 1 ll 

1 

L 1 ° °_ 


[Hi] 


WMl 
(a) 


WM2 
(b) 


WM3 

(c) 


fo ll 


fo 1 0~ 


L° ° °_ 


[i i 5] 


WM4 
(d) 


WM5 

(e) 


WM6 
(f) 


f ll 


|_o o oj 


f o o cP 


|_1 


fo o ol 



L° 1 °J 


WM7 
(g) 


WM8 
(h) 


WM9 
(i) 


















1 












WM10 
(J) 



For error weights 5 through 10 (where only a single misclassification was 
counted) the overall misclassification rated was defined as 



where 



F -- 5- 

En ~n' 



N s for WM5 and WM7 
N a = JN U for WM6 and wmio • 
INu for WM8 and WM9 



(15) 



(16) 



The results of these experiments are presented in Tables III through VI. 
Tables IV through VI present the misclassification rate results, and 
Table III gives both the parameter numbers and the mnemonics of the 
five parameters chosen by the optimization procedure. The results in 
these tables are presented sequentially; i.e., the results obtained using 
only / of the five parameters (/ = 1,2,3,4) are indicated in the appropriate 
rows of the tables. 

Two comments should be made about the data. In many cases, it was 
found that the overall misclassification rate did not monotonically de- 
crease as more features were knocked out of the parameter set. For these 
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Table V— Error Rates for Telephone Line Inputs 



Weight Matrix 


Parameter WM1 


WM2 WM3 WM4 WM5 WM6 WM7 WM8 


WM9 


WM10 


1 16.2 

2 11.7 

3 10.8 

4 
5 


TS2 without two-pole Spectral Normalization 

5.8 12.2 4.5 2.4 0.5 2.6 
5.4 7.4 4.3 0.8 
7.1 3.6 

3.5 

3.5 


1.8 
1.0 
0.7 
0.5 


2.7 
2.1 
1.3 
1.1 



Weight Matrix 



Parameter WM1 WM2 NM3 WM4 WM5 WM6 WM7 WM8 WM9 WM10 



TS2 with two-pole Spectral Normalization 



1 


18.1 


2 


13.4 


3 


12.4 


4 


11.6 


5 


11.5 



7.2 



14.4 


8.6 


10.4 


8.2 


9.4 


7.8 




7.4 




7.2 



2.9 



3.9 
1.1 



2.7 



o.;{ 


4.5 


9.5 





5.2 


6.1 




3.9 


5.0 




3.8 


. 4.2 




3.6 


3.7 



Size of Training and Testing Sets for TS2 
Number of Frames 



Training 



s 


207 


II 


210 


V 


539 



956 



Testing 

375 

378 

1196 

1949 



cases data are presented up to the number of parameters at which the 
error rate kept decreasing. The second comment concerns the specific 
features knocked out in the optimization (as given in Table III). In many 
cases, a large number of features (other than the ones presented) pro- 
vided essentially the same overall misclassification rate as the feature 
that was knocked out. These cases are indicated by an asterisk after the 
feature number in these tables. For such cases, features other than the 
ones indicated in the table may be equally appropriate. 

IV. ANALYSIS OF THE RESULTS 

Several important observations can be made by examining carefully 
the results of Tables III though VI. First, it can be seen by comparing 
error rates for matrix WMi to those for matrices WM2 through WM4 that 
most of the overall error rate for the canonic error matrix was due to 
misclassifications between the classes of silence and unvoiced speech* 
(compare results for WMi and WM3). This result is certainly not unan- 

* Further evidence of this result is given in Table VII, which shows a breakdown of the 
error components. This table is discussed later in this section. 
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ticipated since the band limiting of telephone speech has the most severe 
effect on unvoiced sounds whose spectral components often fall above 
the high-frequency cutoff of the telephone transmission. 

Based on the above result it would seem reasonable to compare error 
scores using error matrices 1 and 4. It can be seen from the tables that 
if one does not consider distinctions between silence and unvoiced 
speech, then an improvement of somewhat more than 2 to 1 in error score 
is obtained. For the case of absolutely certain classifications, an error 
rate of 1.9 percent is obtained for error matrix 4. For test sets TS2 and 
TS3, the error rate for error matrix 4 increases to 3.5 and 5.0 percent, 
respectively. 

The results using error matrix 2 (the speech detection matrix) show 
that, in the case of absolutely certain classifications (Table IV) an error 
rate of 4.3 percent is obtained. For test sets TS2 and TS3, the error rate 
for matrix 2 increases slightly to 5.4 and 5.6 percent, respectively. 

The results of using error matrices WM5-WM10 show that the most 
frequent misclassification occurs between silence and voiced speech in 
which error rates on the order of 2 to 3 percent were obtained for all three 
test cases. The problem here occurs during low-level sounds, such as 
voiced stops where the silence regions are often classified as voiced due 
to the presence of low-frequency components of the signal. Unfortu- 
nately, such signals do not fall neatly into either category and the deci- 
sion algorithm consistently classified them as voiced sounds whereas the 
manual classification was silence. 

Comparisons of the results of Tables IV, V, and VI showed that the 
error scores increased with the complexity of the test set as anticipated. 
However, it is difficult to attach too much meaning to the absolute error 
rates for TS2 and TS3, since the frames which were added constituted 
boundary frames and frames which were subject to classification error 
in the manual classification. The results are presented to provide in- 
formation as to the sizes of the increases in error rate that are to be ex- 
pected with such input test sets. 

The data of Table III (the optimal feature list) are also quite inter- 
esting. The influence of the weight matrix is evident by scanning across 
the rows of the table. Each weight matrix had its own set of optimal 
features, which were different from those of any other weight matrix. 
By scanning down the columns of this table, however, it is seen that the 
influence of the data test set was fairly weak in that the optimal-feature 
set remained substantially the same for all three test sets across the three 
sets of data. 

An interesting result shown in Tables IV through VI is that the two- 
pole normalization scheme did not provide essentially any improvement 
in the classification accuracy across any of the test conditions studied. 
This result is a little surprising in light of the work of Itakura who found 
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that it compensated different telephone transmission conditions quite 
adequately. 3 One possible reason for this result is that the non-Euclidean 
distance metric to some extent compensates automatically for the 
variable telephone transmission conditions by appropriate linear 
transformation of the feature space. Thus, for this classification method, 
the use of a two-pole spectral normalization is of little value. 

An additional breakdown of the error analysis for the most important 
error- weight matrices (WMi, WM2, and WM4) is given in Table VII in 
which the percentage of each type of misclassification is presented. It 
can be seen in this table that certain types of errors dominated the scores. 
For example, no cases occurred throughout the test in which a voiced 
interval was classified as silence. It can also be seen that, as mentioned 
previously, the error rate for silence-to-unvoiced speech dominated the 
overall error rates for error matrices WMi and WM2, whereas no single 
component of the error dominated the overall error rate for matrix 

WM4. 

4. 1 Comparison with wideband results 

Although some numerical scores were presented in Ref. 1 for mis- 
classification rates using the analysis method on wideband (high-quality) 
data, a set of companion results were obtained in this study to compare 
and contrast the error results for wideband and telephone signals. Using 
the identical procedures discussed in Sections II and III, a set of optimal 
features and error rates were obtained for wideband test sets of signals. 
The results of these runs are presented in Tables VIII and IX. Com- 
parisons of the error rate tables (VII and VIII) show the following: 

(i) For error weight matrix WMi, the scores for wideband data were 
from two to three times lower than for telephone data. This is due 
to the vastly improved scores on the category of silence-to-unvoiced 
errors. The error rates for many of the other possible misclassifi- 
cations were quite comparable. 

(m) For error weight WM4, the scores for wideband data were only 
slightly better than for telephone data, indicating that a voiced-not 
voiced decision can be as reliably made over a telephone line as for 
high-quality inputs. However, the speech-not speech decision is 
much more difficult for telephone data than for wideband sig- 
nals. 

(Hi) For error weight matrix WM2, the scores for wideband data were 
from two to eight times lower than for telephone data. This result 
is again due to the improved performance in discriminating between 
silence and unvoiced speech for wideband data. 
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Table IX — Optimal features for wideband test sets and error- 
weight matrices wmi, wm4, and WM2 



Test Set 



WMI 



WM4 



WM2 



TS1 



TS2 



TS3 



13 0(0,1) 


68 ES 


64 


ML 


70 MLS 


25 k(l) 


70 


MLS 


14 0(0,2) 


69* NZS 


61 


LE 


61 LE 


67* MLD 


14 


0(0,2) 


68* ES 




68 


ES 


13 0(0,1) 


68 ES 


64 


ML 


70 MLS 


25 k(l) 


70 


MLS 


14 0(0,2) 


16 0(0,4) 


61 


LE 


61 LE 


64 ML 


68 


ES 


68 ES 




67 


MLD 


13 0(0,1) 


68 ES 


64 


ML 


70 MLS 


15 0(0,3) 


70 


MLS 


14 0(0,2) 


26 k(2) 


61 


LE 


61 LE 


13 0(0,1) 


68 


ES 


68 ES 


52 c(4) 


67 


MLD 



4.2 Typical test example 

Figures 10 and 11 show the results of applying the classification 
method to the utterance, "Few thieves are never sent to the jug," spoken 
by a male speaker. The contour shown in (a) of each figure is a manual 
classification of each frame. Part (b) shows the results of analysis using 
parameters obtained from WMI (Fig. 10) and WM4 (Fig. 11). Part (c) 
shows the results of nonlinearly smoothing the analysis contours using 
a median smoother. 8 Parts (d), (e), and (f) show plots of the probability 
of correct classification based on the distance calculation for each class; 
i.e., if we denote the distance calculated for silence as D s , the distance 
calculated for unvoiced as D u , and the distance calculated for voiced as 
D„, then 



P(S) = 
P(U) = 
P(V) = 



D U D V 



D S D U + D S D V + D u Du 

DsDy 

D S D U + D S D V + D U D U 
D S D U 



(17) 
(18) 
(19) 



D S D U + D S D V + D U D V 
It can be seen that P(S), P(U), and P(V) define a probability measure, 



since 



and 



P(s) + P(u) + P(u) = 1 



^ P(s), P{u), P(u) 5| 1 



(20) 
(21) 
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FEW THIEVES ARE NEVER SENT TO THE JUG 

Fig. 10— The analysis results for the utterance, "Few Thieves are Never Sent to the Jug," 
using optimal features from TSi with weight matrix WMi. 



for all values of D s , D u , and D v . Furthermore, the probabilities satisfy 
the relation 



lim P(a) — 1 



(22) 



Dn-0 



and 



lim P(a)— 0. 



(23) 



D a - 



Thus, as the distance increases, the probability measures decrease. 

Contrasting the silence-unvoiced- voiced contours of Figs. 10 and 11, 
the following observations can be made: 

(i) The results obtained using features derived from matrix WM4 es- 
sentially never classified frames as silence. Instead all silence frames 
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Fig. 11— The analysis results for the same utterance as Fig. 10 using optimal features from 
TSi with weight matrix WM4. 

were classified as unvoiced, consistent with the zero weight given 
to this type of error. 

(ii) Both sets of results contain only a small number of misclassifica- 
tions of voiced intervals. All but one of these voiced misclassifica- 
tions occurred at boundaries between voiced and nonvoiced 
speech. 

(Hi) The probability measures for voiced speech using features derived 
from WM4 were somewhat higher throughout the voiced regions 
than corresponding results derived from WMi features. This indi- 
cates that a somewhat better feature set for voiced sounds is ob- 
tained at the tradeoff of the high error rate for silence-to-unvoiced 
errors (and vice versa). 

Results similar to those discussed above have been obtained for a wide 
variety of utterances tested on the system using these sets of features. 
It is concluded that if one is willing to forego any attempt at distin- 
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guishing between unvoiced sounds and silence, then reliable voiced- 
nonvoiced decisions can be obtained over telephone lines. 

V. SUMMARY 

Through a series of fairly extensive tests, we have investigated quite 
thoroughly the potential of a fairly sophisticated silence-unvoiced-voiced 
classification system. We have shown that, depending on the weight 
attached to various types of misclassifications, a set of optimal features 
can be found that minimizes the weighted misclassification error rate. 
For telephone line inputs, the results showed that reliable discrimination 
between silence and unvoiced sounds is quite difficult; however, reliable 
discrimination between voiced and nonvoiced sounds (silence or un- 
voiced speech) can be achieved at error rates fairly close to those obtained 
with wideband input signals. 

Extensive testing of the optimal feature sets obtained from the analysis 
showed the method to be reliable enough for use in several typical ap- 
plications in the area of man-machine communication by voice. 9,10 

One aspect of the analysis system which was not varied was the dis- 
tance metric used in the final classification. Although the non-Euclidean 
distance metric is a very powerful one for the features studied, other 
distance metrics have been proposed based on fixed parameter sets, such 
as the LPC parameters, etc. 311 Investigations into the applicability of 
such distance metrics to the silence^un voiced -voiced classification 
problem are currently in progress. 
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